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READING POLICIES FOR JOINS: AN ASYMPTOTIC ANALYSIS 

By Ralph P. Russo and Nariankadu D. Shyamalkumar 1 

University of Iowa 

Suppose that m n observations are made from the distribution R 
and from the distribution S. Associate with each pair, x from 

R and y from S, a nonnegative score <f>(x,y). An optimal reading 
policy is one that yields a sequence m n that maximizes E(M(n)), the 
expected sum of the [n — m n )m n observed scores, uniformly in n. 
The alternating policy, which switches between the two sources, is 
the optimal nonadaptive policy. In contrast, the greedy policy, which 
chooses its source to maximize the expected gain on the next step, is 
shown to be the optimal policy. Asymptotics are provided for the case 
where the R and S distributions are discrete and 4>(x, y) = 1 or ac- 
cording as x = y or not (i.e., the observations match). Specifically, an 
invariance result is proved which guarantees that for a wide class of 
policies, including the alternating and the greedy, the variable M (n) 
obeys the same CLT and LIL. A more delicate analysis of the se- 
quence E(M(n)) and the sample paths of M(n), for both alternating 
and greedy, reveals the slender sense in which the latter policy is 
asymptotically superior to the former, as well as a sense of equiva- 
lence of the two and robustness of the former. 

1. Introduction. Suppose that samples of size m n and n — m n are drawn 
from tables R and S in a database, table R containing information (age, 
interests, education level, etc.) on a group of single males, and table S the 
same information on a group of single females. Associate with each pair of 
records, x from R and y from S, a nonnegative score 4>(x,y) whose value 
depends on how closely the two records agree. A male and female of similar 
age, with common interests and education level, would have a high score (a 
value near 1 on a [0,1] scale, e.g.). The goal is to choose m n to maximize 
E(Af (n)), where M(n) is the sum of the m n (n — m n ) scores generated by 
the n records that have been read. In this way, the expected overall interest 
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level between the two groups (after n reads) is maximized. Alternatively, R 
may contain information on a group of buyers in a marketplace (specifically, 
which items each seeks to buy) and S information on a group of sellers 
(which items each seeks to sell), the goal then being to maximize the level 
of commerce between the groups. 

The alternating and myopic policies. Suppose that observations are made 
sequentially and without replacement from each of two sources (popula- 
tions) R and S. An algorithm that sequentially chooses the source for each 
observation is referred to as a reading policy. An optimal reading policy (if 
existent) is one that maximizes E(M(re)) uniformly in n. Two reading poli- 
cies of interest are the alternating, which alternately samples from R and 
S, and the myopic (or greedy), which on each step chooses the source that 
maximizes the expected gain E(M(n)) — E(M(n — 1)) for that step. The 
alternating policy is interesting because it is easy to implement and requires 
no knowledge of R or S. Moreover, this policy is optimal in a restricted 
sense (see below). Any policy with a fixed sampling order for which the R 
sample size is always within one of the S sample size is considered an alter- 
nating policy, as all such policies produce the same expected total score at 
all steps. In contrast to the alternating policy, the greedy policy requires a 
complete knowledge of R and S. It is a short term strategy that optimizes 
the expected gain on the next step, with no explicit regard to future gains. 
Note that there may be more than one greedy policy, as occasionally the 
greedy criterion may be ambivalent between R and S. 

In the case of the equijoin, the records x from R and y from S can be 
categorized by positive integer values r(x) and s(y) with <f)(x,y) = 1 or 
accordingly as r{x) = s{y) or not. When (p(x,y) = 1, we say that records x 
and y match. Optimality in the case of the equijoin was studied in [16]. When 
R or S is finite, it was shown that an optimal policy need not exist, that the 
alternating policies are optimal among the restricted class of nonadaptive 
policies (those that ignore the information obtained from the samples), and 
that any greedy policy dominates (and in most cases is strictly better than) 
any alternating policy. That alternating is the optimal nonadaptive (R and 
S both infinite or not and <fi arbitrary) and is easy to show, so is stated here 
without proof. 

When R and S are infinite, the problem reduces to i.i.d. sampling from 
those distributions. In this case it is shown in [16] that alternating is again 
optimal among the nonadaptives, and that greedy is optimal among all read- 
ing policies. In the next section we provide a simpler proof of a much stronger 
result; namely, that greedy is the optimal policy under the so-called total 
expected discounted reward criterion, for any decreasing discount sequence. 
From this it follows that greedy is the optimal policy in the eft arbitrary 
case. This case includes an interesting class of score functions which sat- 
isfy (j)(x,y) = 1 or (like the equijoin), but (unlike the equijoin) allows 
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4>(xi,yi) = 4>{x2,y\) = 0(^2,2/2) ^ ^(^1,2/2)- Such is the case when all obser- 
vations are points on a space and 4>(x,y) = 1 or accordingly as x from R 
and y from S are within a prescribed distance t of each other. 

Our interest in the alternating and greedy policies stems from the above 
optimality properties. Our main focus is on the asymptotic properties of 
the alternating and greedy policies in the i.i.d. (infinite populations) case. 
To simplify the presentation, we confine our attention to the equijoin. A 
preview of the kinds of results we seek is provided in the following example. 
Interestingly, even in this simple scenario, the analysis is not trivial. 

An illustrative example. On each step of a coin tossing experiment suppose 
that one may choose either of two coins to toss: one of them fair and the 
other two-headed. Let M&_{n) denote the numbers of matches formed by 
the policy that alternates between the coins (starting with the fair) after 
a total of n tosses (nth epoch) have been made. We have that Mx(n) = 
(ra/2)Bin(n/2;l/2) for n even and M A (n) = [(n - l)/2] Bin((n + l)/2; 1/2) 
for n odd; Bin(m;p) being a binomial random variable with parameters 
m > 1 and p £ [0,1]. In particular, this implies that at the nth epoch the 
expected numbers of matches equals n 2 /8 or (n 2 — l)/8 accordingly as n is 
even or odd and that 



It can be easily checked that the following is a member of the class of 
greedy policies and therefore optimal: toss the fair coin until heads is ob- 
tained, toss the two-headed coin twice, return to the fair coin and repeat the 
cycle. We denote the number of matches at the nth epoch using this greedy 
policy by Mc(n). 

The derivation of a closed form expression for E(Mc(n)) is a bit more 
involved than it was for E(Ma (??-)). A method outlined in Section 4 yields 



which in turn implies that E(M A (n)) < E(M G (n)) < E(M A (n + 1)) for n > 
3. Thus, there exists a rather tight link between the expectations of the two 
processes. 




where (3 



(1.1) 




(1.2) 
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An approach to understanding Mc(n) [and not just E(Mc(n))] uses an 
embedded renewal counting process {iV(n)} n >i, with a renewal occurring 
upon the observance of a tail, and an inter-arrival variable 3Z + 1, where 
Z is a unit mean geometric random variable. The relation between Mc,(n) 
and N(n) depends on the state occupied at the nth epoch; the states being 
a tail, heads with fair coin, first heads (i.e., not preceded by another) with 
2-headed coin and second heads (i.e., preceded by another) with 2-headed 
coin. For example, Mg(w) = (2/9)(n — N(n) + 2)(n — N(n) — 1) when the 
process has just observed a tail. We note that the state process is a doubly 
stochastic Markov chain. 

The above with the approximation Mc(n) ~ (2/9) (n — N(n)) 2 and the 
CLT for renewal counting processes (see [12], page 62) implies that Mg(?i) 
has the same weak limit as Ma ( n ) ■ Law of the iterated logarithm results for 
Mg( - ) and Ma(-) are likewise easily obtainable and again coincide. Also, 
the exact expressions relating N(n) and Mg(ji) along with the geometric 
rate of convergence to stationarity of the state process Markov chain and 
the expectation of the residual lifetime (or overshoot) of the renewal process 
yields (1.2). 

Another phenomenon that we find interesting is the following: when both 
policies are driven by the same sequence of fair coin tosses, alternating beats 
(produces more matches than) greedy on infinitely many epochs, with prob- 
ability one — the optimality of the greedy notwithstanding. The reason for 
this is that the coin sequence is obliged to transition (in two steps) from the 
state (k heads, k tails) to {k heads, k + 2 tails) infinitely often. And when 
it does, we observe that alternating produces k more matches than greedy 
upon completion of the Ak + 2nd step. It can similarly be argued that greedy 
beats alternating infinitely often with probability one. 

1.1. Overview of results. Using a method from dynamic programming, 
we prove in Section 2 that the greedy policy is optimal under the total 
expected reward criterion for any finite horizon. This result extends our re- 
sult in [16] (that greedy is optimal in the i.i.d. equijoin case) to general score 
functions <f>. For our asymptotic analysis, rather than exploit a renewal struc- 
ture (as in the example above), we instead take advantage of an embedded 
martingale structure in order to produce a broader range of results. 

A key to the weak and strong limiting behavior of Me(n) is an imparl- 
ance result (proved in Section 3) which says that the asymptotic behavior 
of M(n) under any policy is governed by the variable R(n), the number 
of records read from R from among the first n records read. We prove a 
central limit theorem and law of the iterated logarithm for Ro,(n), which 
yields (by invariance) a common CLT and LIL for Ma_(u) and Mc(n). Thus, 
an observer working with perfect knowledge of the distributions of R and 
S can not do much better (produce more matches) using the greedy policy 
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than his counterpart who uses the alternating policy and is ignorant of those 
distributions. 

In Section 4 we take up the mathematical question of how much better 
is greedy than alternating. Specifically, we find an expression for E(Mg(w)) 
that is similar to (1.1), but which contains a low order (linear) error term. 
As in the illustrative example, this yields 

lim E(M A (n)-M G (n)K Q 

n — >oo ji 

and for a finite constant k computable from the distributions of R and S, 

E(M A (n)) < E(M G (n)) < E(M A (n + k)) for all large n. 

The former statement uncovers a measure by which greedy is asymptotically 
superior to alternating, while the latter reveals how tightly connected the 
two processes are, in terms of their expectations. We next identify the weak 
limit of (Mc(fi) — M A (ro))/n 5//4 as a scale mixture of normals, showing that 
M G (n) and M A (n) differ by a higher order (than linear) stochastic term 
which is symmetric about zero. Finally, we present a crude LIL type result for 
M G (n) — M A (n) which shows that although E(M A (n)) and E(M G (n)) are 
tightly linked via the above inequality, it takes an arbitrarily large number 
of epochs (infinitely often, with probability one) for the sample path of one 
process to catch up with that of the other. 



1.2. Notation. All vectors carry a tilde. For two vectors, x and y, with 
the same dimensions, x ■ y will denote their inner product. Almost sure con- 
vergence, convergence in probability and weak convergence will be denoted 

by -^>, and respectively. We denote the iterated logarithm by log 2 , 
that is, log 2 (n) = log(log(n)). 

2. Preliminaries. We consider two sources R and S with both contain- 
ing infinitely many records. A record from either source, R or S, carries 
a single positive integer valued label. The probability that a record from 
the R source (resp., the S source) carries the zth label is (resp., Si). The 
probability vectors (ri,r2,...) and (si,S2,...) are denoted by r and s, re- 
spectively. The inner product of f with s is denoted by /i, that is, fj, = f ■ s. 
We shall assume fi to be positive, as otherwise there will be no common 
label between the two sources. The labels on the nth records read from the 
R and S sources are denoted by Lpt(n) and Lg(n), respectively. The above 
implies that {Ia(n)} n >i and {Ls(n)} n >i are sequences of independent and 
identically distributed random variables with 



Pr(L R (l)=i) = r; and Pr(L s (l) = i) = s u i = l,2,. 
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Associated with the sequences {LR,(n)}„>i and {^s( n )}n>i ar e the discrete 
time vector counting processes {Nn(n)} n >i and {-/Vs(n)} n >i; the first is 
defined by 

JV R (n) = (JV R (n,l),JV R (n,2),...), 

n 

with N n (n,i) =^2l{L R (j)=i}, i,n>l, 
3=1 

and the second is defined analogously. 

2.1. Reading policies. A reading policy is a zero-one valued stochastic 
process 

qi \ — /"'■' ^ ^ e selection is from R, —12 
\ 0, if the nth selection is from S, 

Associated with each reading policy are two counting processes {i?(n)} n >i 
and {S(n)} n >i defined by 

n 

R(n):=J2 C U) and S(n):=n-R(n), n = l,2,.... 
i=i 

These processes keep track of the number of records read from R and S, 
respectively. We shall refer to R(n)/n as the selection ratio. Also associated 
with a reading policy is a nondecreasing process {M(n)} n >i which counts 
the number of matches, generated by the first n records: 

M(n) = N R (R(n)) ■ N s (S{n)), n = 1,2, . . . . 

Observe that all of the processes {M(n)} n >i, {R(n)} n >i and {S(n)} n >i 
depend on the reading policy even though the notation does not make it 
explicit. 

The filtration {J- n }n>o for a given reading policy is defined by 

T n := V a(L n (l), L R (R(n));L s (l), ■ ■ .,L s (S(n))), n = 1,2, . . . , 

with containing all the information needed for randomization and inde- 
pendent of {L R (n)} n >i and {Ls(n)} n >i. All reading policies are required 
to be predictable with respect to the above filtration — otherwise they would 
not be implementable. 

Definition 2.1. An alternating policy is a J-q measurable reading pol- 
icy for which 

(2.1) R(2n) = n, n = l,2,.... 



(2.2) C(n+1) 



READING POLICIES FOR JOINS 7 

In words, an alternating policy is one which does not use any information 
from the records, and under which at any step the numbers of records read 
from the two sources are within one of each other. There exists an infinite 
number of alternating policies. One of the simplest alternating policies is 
defined by C(n) = nmod2. In fact, in the arguments we tacitly assume for 
convenience that we are working with this version. Prom the point of view 
of implementation though, one may prefer the alternating policy given by 
C(n) = /{nmod4<2} as it, leaving apart the first record, reads two records at 
a time from the chosen source. 

Toward defining greedy policies, we observe that 

E(M(n + 1) - M{n)\T n ) = E(N S [S (n) , L R (R(n) + l)]\? n )C(n + 1) 

+ E(N R [R(n),L s (S(n) + l)]\F n ){l - C(n + 1)). 

Hence, any reading policy C(-) maximizing the above conditional expecta- 
tion should satisfy, for n > 1, 

1, iiE(N s [S(n),L R (R(n) + l)]\T n ) 

>E(N R [R(n),L s (S(n) + l)]\F n ), 
0, ifE(N s {S(n),L R (R(n) + l)]\F n ) 

<E(N R [R(n),L s (S(n) + l)]\f n ), 

with no requirement on epochs where 

(2.3) E(N s [S(n),L R (R(n) + l)]\T n )=E(N R [R(n),L s (S(n) + l)]\^ n ). 

As {L R (n)} n >i and {£s( n )}n>i are sequences of i.i.d. random variables, we 
have 

E(N s [S(n),L R (R(n) + l)]\F n ) = N s (S(n)) • f, n = 1,2, . . . , 

and an analogous relation for N R R(n) ■ s. 

Our analysis depends on the observation that NsS(n) ■ f and N R R(n) ■ s 
are both partial sums of i.i.d. observations. To make this explicit, we define 

X R (n) :=s LR ( n) and X s (n) := r Ls(n) , n=l,2,.... 

The two sequences {^R.(n)} n >i and {Xs(n)} n >i are sequences of i.i.d. ran- 
dom variables with common mean [i and variances a R and cr|, respectively. 
We shall denote their partial sums by T R [-] and Tgf-], that is, 

n n 

T R [n]=Y J X R {j) and Tg[n] = ^X S (j), n = l,2,.... 
i=i j=i 

This leads to the relations 

N S (S(n))-r = T s [S(n)] and N R (R(n)) ■ s = T R [R(n)}, n = l,2,.... 
Combining the above with (2.2) leads to the following definition. 
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Definition 2.2. A reading policy C(-) is called a greedy policy if it 
satisfies 



(9A ) r^ + n-/ 1 ' ifr s [5(n)]>r R [i2(n)], 

[ ' U[n+L) -\0, if T s [S(n)}<T R [R(n)], 



n= 1,2,.... 



Henceforth, all quantities with a subscript of G will pertain to a greedy 
policy and those with a subscript of A to an alternating policy. An important 
consequence of the definition of a greedy policy is that 

(2.5) |r R [i2G(n)]-rs[5 G (n)]|<7, n=l,2,..., 

where 7 := max^-K^ r» V max^Koo Sj. 

The case <7r + as = is equivalent to having r and s as uniform distri- 
butions with identical finite supports. And it is easily checked that identical 
uniform distributions make the set of all greedy policies coincide with the set 
of all alternating policies. Hence, in the following we will assume or + os > 0. 

2.2. Optimality of the greedy. Here we show using dynamic programming 
that the greedy policy maximizes the expected number of matches at all 
epochs in the case of infinite populations. A key observation to showing this 
is that the incremental gain of matches from the (n + l)st record can be 
written in terms of r R [i?(n)] and TsfS^n)] as 

E(M(n + 1) - M{n)\r n ) = T s [S(n)]C(n + 1) + T R [R(n)](l - C(n + 1)). 

This suggests a rather compact Markov Decision Problem (MDP) formulation- 
at the nth epoch, the state is defined as (r R [i?(n)],rs[£(n)]) and the ac- 
tion of choosing the next record from the R source results in a reward of 
r s [5(ra)] with (r R [i?(n)] + X R (R(n) + l],r s [5(n)]) as the new state (when 
the next record is chosen from S the reward and the new state are analo- 
gously defined). That this compact representation fails in the case of finite 
population(s) is easily demonstrated; see [16]. 

Abstractly, following the system of specifying a MDP as given in [11], 
consider the MDP with decision epochs {1, ...,N} for some N > 1, state 
space [0,oo) x [0,oo), action set (invariant to the current state) {0,1}, with 
time homogeneous expected rewards 

r(xi,a) =Ci+a for a = 0, 1; f = (£i,£ 2 ) G [0,oo) x [0,oo) 

and time homogeneous transition probabilities 

(^2(^2-^2), a = 0and£i=fi, 
P(£ l£;«)H Pi(£-&), o = l and 
1 0, otherwise, 

where pi(-) and P2O are probability densities (with respect to some a-finite 
measure A) on [0, 00) with a common mean, say, 0. In terms of our original 
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problem, action 1 (resp. 0) corresponds to picking the next record from R 
(resp. S), pi(-) [resp., P2(-)] corresponds to the mass function of Xr (resp. 
X$) and the reward is the expected increment in the number of matches 
from the next record. 

The above MDP, while reminiscent of a two-armed bandit (see, e.g., [11] 
and [13]), is not quite so. Considering the renewal processes, with inter- 
arrival distributions pi(-) and P2(-), as the states of two projects, the ex- 
pected reward generated by choosing a project is equal to the state of the 
other. This dependence of the reward on the state of the other (idle) project 
fails one of the requirements of the bandit problem; see [11]. Nevertheless, it 
fits the formalism of the (two machine) tax problem of [18] (ongoing bandits 
of [4]) of where the reward structure is in a sense the reverse of the bandit 
problem. 

In the tax problem, at any epoch, one of K machines can be operated 
with idle machines generating a cost (the tax). The goal is to schedule the 
machines in order to minimize, for example, the expected total discounted 
stream of costs. Interestingly, from the point of view of a search for the 
optimal strategy, the tax problem (with infinite horizon and discounted re- 
wards) and the bandit problem are equivalent; see [18]. As shown in [1], 
such an equivalence holds even while allowing all machines, active or inac- 
tive, to generate either a reward or a cost (negative reward) with the goal 
of maximizing the total discounted stream of rewards. Also, our MDP can 
be seen to be a particular case of the generalized bandit problem of [10]. 
While [1, 10] and [18] look at an infinite horizon discounted reward crite- 
rion, our interest is in the finite horizon analysis of our MDP. Below we show 
by a simple inductive argument that the greedy (myopic) policy is optimal 
in the finite horizon case under the total expected reward criterion. For an 
involved proof using the interchange argument; see [16]. Also, it is not hard 
to construct a simple qualitative argument along the lines of the proof of 
the Gittin's index theorem of [19]. 

It should be no surprise, given the time homogeneity and two point action 
set, that optimal deterministic Markov policies exist; see, for example, Theo- 
rem 4.4.2 of [11]. Moreover, it is easily argued from the reward structure that 
an optimal deterministic Markov policy which is a function of £i — £2 exists. 
Below we additionally show that this policy is given by /(_oo,o) (£1 ~~ £2); the 
greedy policy. 

Theorem 2.1. The greedy policy is optimal under the total expected 
reward criterion for any finite horizon. 

Remark 2.1. That the greedy policy maximizes the expected number 
of matches at all epochs implies that it also maximizes the total expected 
discounted incremental matches for all nonincreasing discount sequences. 
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Remark 2.2. Note that the above theorem also implies that the greedy 



where </>(•,■) is nonnegative. This implies that optimality of the greedy ex- 
tends beyond equijoins. Moreover, the proof allows Lr and Ls to be random 
elements on any space. 

Remark 2.3. The above theorem does not extend directly beyond two 
sources. This is reminiscent of the bandit problem with different discount 
factors for each bandit — Gittin's index exists in the case of the two armed 
bandit, but not beyond. For details, we refer to [4]. 

Proof of Theorem 2.1. The proof is by induction. Let V n (£) denote 
the maximum total expected reward for the n-epoch (n more records to 
pick) problem at state £. Clearly, Vi(£) = max(^i,^2), which is attained by 
the greedy policy. Assume without loss of generality that £i > £2. Now, since 



we have the greedy policy is optimal for the 2-epoch problem too. Now 
assume that the greedy policy attains V^(£) for i = 1, 2, . . . , (n — 1) for all 
£ G [0,oo) x [0,oo). That V n (£) is also attained by a greedy follows from 



maximizes 




6 + /Vi((6,6 + C))^ 2 (C)>6 + + C)^ 2 (C) 

= £1 + 6 + 






> 





Hence, the proof. □ 
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3. Basic weak and strong limit theorems. The heuristics in the introduc- 
tion suggest (and the results of this section confirm) that for a policy to be 
competitive, its selection ratio must converge to 1/2. However, in some ap- 
plications the observer may not have control of the sampling order, or may 
find it cost effective to sample unevenly from the two sources. Thus, the 
case where R(n)/n a G (0, 1) is of interest. In this section we show that 
the number of matches, when suitably centered and scaled, can be strongly 
approximated by a standard Wiener process. From this approximation it is 
easy to obtain a CLT and LIL for the number of matches. 



Theorem 3.1. Consider a reading policy that satisfies R(n)/n 
(0, 1) and the associated process Z(-) defined by 



a G 



Z(t) :- 



0, 

M(n) 



n 



forO<t<Vi, 

for V n < t < V n +i and n > 1, 



na(l — a) ix, 

where V n := (1 — a) 2 o~ 2 R R(n) + a 2 cr|5(n) for n > 1. If for some 0>O, 
(3.1) 



n 



' (R{n) 



pog 2 (n)](i-0 V n 



a 



0. 



then a probability space can be constructed which supports a standard Wiener 
process W and a process Z' such that, 



{Z{t):t >0} ={Z'{t):t >0} and 



\ Z'(t)-W(t)\ 
i[log 2 (i)]M 



0. 



In the case where a = 1/2, condition (3.1) may be replaced by the weaker 
condition 



(3.2) 



^fRi 
V r, 



a 



[log 2 (n)](i-/3) 

Corollary 3.1. For any reading policy satisfying 
R(n 



0. 



(3.3) 
we have 
(3.4) 



n 



M(n) 



a(l — a)n 2 



a 



l> 



0. 



N 



with a G (0, 1), 
(1 - a)o\ + 



a(l — a) 



In the case where a = 1/2, condition (3.3) may be replaced by 



(3.5) 



n 



1/4 



(Rip) 



a 



n 



0. 
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Remark 3.1. In Corollary 3.1 almost sure convergence in (3.3) and 
(3.5) may be replaced by convergence in probability. This follows from the 
proof of Theorem 3.1 and the martingale central limit theorem (see, e.g., 
Theorem 7.4 of [3]). 

Corollary 3.2. For any reading policy satisfying 



(3.6) /» (*W_ a W 

y log 2 n \ n J 

we have, with probability one, 

. ,. M(n) - n 2 a(l - a)u , „ M(n) — n 2 a(l — a)u 

(3.7) hmsup \> \ ^ = l = -liminf K > | ^ , 

where k := a(l — a)((l — «Vr + acr|). W7iena = l/2, condition (3.6) may 
6e replaced by 



(3.8) 



log 2 n 



V r, 



a) ^0. 



From the above corollaries we see that the central limit and iterated log- 
arithm behavior of the number of matches are invariant among the class of 
policies whose selection ratios converge to a sufficiently fast. Included in this 
class are the alternating policies and (it will be shown) the greedy policies, 
both with a = 1/2. Thus, these policies obey the same CLT and LIL, the 
optimality of the latter policy notwithstanding. 

We observe that conditions (3.3) and (3.6) fail under Bernoulli sampling, 
where the source is determined by independent tosses of an a-coin, 1/2. 
Before discussing this case further, we give a simple example showing that 
these conditions can hold for a nondeterministic policy which imposes a 
restorative pressure to keep its selection ratio close to a. Consider a reading 
policy with R(l) = 1 and which for n > 1 chooses source R with probability 
a\ G (0, a) [resp., a 2 £ (a, 1)] when R(n — 1) > a(n — 1) [resp., R(n — 1) < 

st St 

a(n — 1)]. For such a policy, we have — U < R(n) — an < V, where U and V 
are defined by 



U := min 

and 

V := min 



jfc:£jX 2ii >afc + lj 
^k:J2Xxj<ak-l^, 
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with {Xij :j > 1} an i.i.d. Ber(aj) sequence for % = 1,2. By Bernstein's in- 
equality (e.g., see [15]), U and V have exponential tails from which condi- 
tions (3.3) and (3.6) follow. 

We will see in the proof of Theorem 3.1 below that 



M{n) 



a(l — a)n 2 



and \Jn 



;i - a)T n [R(n)] + ar s [S(n)] _ ' 



a(l — a)n 

share the same weak limit quite generally and, in particular, under Bernoulli 
sampling. Moreover, note that under the Bernoulli sampling the expression 
(1 — a)Tn[R(n)] + ars[5(n)] is the nth partial sum of a sequence of inde- 
pendent variables, all with the distribution of (1 — a)R(l)Xn(l) + a(l — 
R(l))Xs(l)- Thus, by the ordinary CLT for i.i.d. sequences, we obtain the 
CLT (and, by a similar argument, the LIL) for M(n) with the asymptotic 
variance given by 

^ 2 (l-2a) 2 + (1 -a)a^ + aal 
a(l — a) a(l — a) 

More generally, when {R(n)} n >i is independent of the labels, the CLT 

holds under \/n(R{n)/n — a) — — > N(0,cr 2 ), with the asymptotic variance 
given by 



H(l - 2a) 
a(l — a) 



V 2 + 0-- a )°B. + aa B 



a(l — a) 

The argument uses Kolmogorov's maximal inequality to show that Y n in 
(3.12) is sufficiently close to the partial sum of the first not Jr(-)'s and 
n(l — a) A^s(-)' s - The CLT follows from the independence of this partial 
sum and R{n). 

It is interesting to note the more stringent requirement in the above corol- 
laries on the rate of convergence of the selection ratio when the limit is other 
than 1/2. This is to account for the phenomenon that while the policy which 
uses Bernoulli sampling with a = 1/2 obeys the same CLT as an alternating 
policy (and, as we shall see, a greedy policy), the policy which uses Bernoulli 
sampling with has a higher asymptotic variance than a policy for 

which R{n) = \na\ [cf. the asymptotic variance in (3.4) to the expression in 
(3.9) for the cases a = 1/2 and a 7^ 1/2]. 

We now state the CLT and LIL for the selection ratio of a greedy policy, 
the latter result yielding both the CLT and LIL for the number of matches 
via an application of Corollary 3.1 and Corollary 3.2, respectively. 

Theorem 3.2. For the greedy reading policy Cg, we have 

. (RG(ri) — n/2) d , 9 , 9 /or+Cc 



n G G V 8/i 
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Theorem 3.3. For the greedy reading policy Cq, we have 

R G (n)-n/2 



rtin R G (n)-n/2 
(3.11) limsup = 1 

\2a\ nlog 2 n 



lim inf 

n— >oo 



2cr| G nlog 2 n 



w.p. 1. 



Corollary 3.3. For both greedy and alternating policies, we have 



M(n) 



N(0,2(^ + oi)). 



.(n/2) 2 

Moreover, for these policies, we also have (3.7) with a = 1/2. 

While both of the above theorems are of independent interest, the former 
is of interest also for the similarity of its derivation to that of the weak limit 
of a sequence of stopping times needed in the next section, and the latter 
for its application to the number of matches. 

3.1. Proofs of Theorems 3.1-3.3. 

Proof of theorem 3.1. First, we observe that 
M(n) 



n 



(3.12) 



— 2a](R(n) — an) 

-N n (R(n)) 



+ a(l — ct)n\ - 



an 



( Ns(S(n)) 
\ (1 — a)n 



where (x a := a(l — a)fi and for n > 1, 

(3.13) Y n := (1 - a)T R [R(n)] + aT s [5(n)] - [an + (1 - 2a)R(n)]fi. 
Second, we show that 

(M(n))/n-fi a n-Y n 



(3.14) 



^n[log 2 (n)](^/3) 



0. 



Let a n := J n/[log 2 (n)]( 1 @) for n> 1. Toward showing (3.14), we observe 
that when a = 1/2, the term a n fi[l — 2a](R(n)/n — a) is and otherwise it 
converges to in the almost sure sense by (3.1). We note that this is the only 
reason for requiring the stringent condition (3.1). Now the proof of (3.14) is 
completed if we can show that (3.2) implies that 



N R (R(n)) 



an 



( N s S(n) 
\ (1 — a)n 



0. 
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By the Cauchy-Schwarz inequality, we have 



(3.15) 



N R R(n) 



an 



( Ns(S(n)) 
V (1 -a)n 



< 



N R [R(n),i] 



an 



J N s [S(n),i] 
\ (1 — a)n 



By symmetry, it suffices to show that the first term on the right-hand side 
of (3.15) converges to zero in the almost sure sense: 

2 



E 

i=l 



N R [R(n),i] 



an 



a n log 2 R(n) ( R(n) 



R(n) 



E 



/N R [R(n),i] 



R(n) V an J log 2 R(n) r-f \ R 



n 



n 



O(l) a.s. by Lemma A.l 



+ 2. 



' a n log 2 R{n) f R{n 



R(n) 



an 



'a, 



■( Rip) 

n I 

V an 



R{n) 



,0 ^0 by (3.2) 

' 'N n [R(n),i] 



E 



log 2 R(n)f^\ R( 



in 



n r 



+ a n 



R(n 



an 



O(l) a.s. by Lemma A.l 
2 oo 

0. 



i 5> 



>2 a.s. 



D — Y — Y 



n-1 



^0 by (3.2) 

Third, we show that {Y n } n >i is a {J^ n }n>i martingale with bounded 
increments. Toward this, we note that, for n > 2, 

{l-a)(X R (R(n))-u), if C(n) = 1, 
a(X s (S(n))-u), ifC(n) = 0, 

with Di := Y 1 . 

Now as C(n) is J- n ~\ measurable and both X R {R{n — 1) + 1) as well 
as Xs(S(n — 1) + 1) are independent of T n -\, we have H,{D n \T n -\) = 0. 
Moreover, as D n is bounded and Y n is J- n measurable, we have the above de- 
scription of {Y n } n >\. Further, observe that E(D^| T n -\) = (1 — a) 2 a 2 R C{n) + 
a 2 a^(l — C(n)), which implies that 

TKV F)2 1 T \ n „^2^2 R( n ) 2^2 S( n ) 

- 2^ E ( D k\Fk~i) =(l-a) a R —— + a a s — — 
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a(l — a)((l — oVr + ao ~s)- 

Fourth, we observe that the above description of {Y n } n >i along with The- 
orem 3.2 of [8] implies that a probability space can be constructed (a suitably 
augmented version of the one in [8]) which supports a Wiener process W 
and a sequence {(R'(n), L R (n), L' s (n))} n >i satisfying the following: 

(i) {( J R / (n),L' R ( ? i),L / s (n))} n > 1 ^{( J R(n),L R (n),Ls(n))} n > 1 , 

(ii) If Y' n is the same function of {(R'(n), L R (n), L' s (n))} n >i as Y n is of 
{(i?(n),L R (n),L s (n))} n >i and 



Y'(t) :- 



0, for < t < V{, 

Y' (n) , for < t < V^ +1 and n > 1, 



where K : = C 1 - a) 2 a 2 R R'(n) + a 2 cr|S"(n) for n > 1, then 

( 3.i6) ino-^ffli ..... 

V 't|log 3 (t)]<i-ffl 

To complete the proof, let {M'(n)} n >i be the sequence defined exactly 
as {M(n)} n >i but using the sequence {(R'(n), L R (n), Lg(n))} n >i instead 
of L R (n), L§(n) )} n >i. Also, let Z' be the process defined like Z but 

using {(M'(n),F / ( n ))}n>i instead of {(M(n), V(n))} n > x . Then (3.14) and 
(3.16) together imply 

\Z' {t)-W(t)\ a . s . n 



/tpog 2 (t)](i-/J) 
Hence, the proof. □ 

To prove Theorems 3.2 and 3.3, we utilize the following lemma. This 
lemma provides a tight connection with partial sums of i.i.d. variables that 
is key in the proofs of the two theorems. 

Lemma 3.1. In the case of the greedy algorithm we have, for < x < n, 
r R [[x]] <T s [n- -7 R G {n)>x 

=> r R [Lxj] <r s [n- Lxj]+ 7 . 

Proof. Since Ts[-] and T R [-] are nondecreasing, for < x < n, we have, 
by (2.5), 

=> rs[LxJ]<r R [ J R G (7i)]<rs[5 G (n)]+7<r s [n- |*J]+7- 
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The other half follows by observing that, for < x < n, 

=> r n [\x]} > T R [R G (n)] > r s [5 6 (n)] - 7 > T S [n - \x\]-j. □ 

Proof of Theorem 3.2. First, we argue that 
^B,[k n ] - T s [n - k n ] 



Z{n) 



(3.17) 



vW2 

^N(2 3 / 2 /ij;,cr| l + cj|), as n — > oo 



for any sequence {k n } n >i satisfying 

'k n - n/2 



lim 

71— >00 



I? 



To this end, note that 



Z(n) = a T 



E-=i*R(j)-fc w /i 



(3.18) 



\Jn - k n 



+ c n 2 3 / 2 



fix, 



-^N(0,4) 

where the three sequence {a n } n >i, {b n } n >i and {c n } n >i all converge to 1. 
The above observed weak limits are due to the ordinary central limit theo- 
rem. By independence of the first two terms in (3.18) and Slutsky's theorem, 
we have (3.17). Now defining and Z* as Z but with the sequence {fc n }n>i 
taken as { [n/2 + x^fn\ }„>i and { [n/2 + x-^/nJ } n >i, respectively, we have, 
as n — > oo, 



(3.19) 



Z m (n) -^N(2 3/ V,^r + ^I) and 

Z*(n) ^N(2 3 /V,^r + ^D- 
By Lemma 3.1, we have for large n 



(3.20) 



Pr Z*(n) < 



< Pr 



^ (R G n -n/2) 



> x 



n 



< Pr Z*(n) < 



Vnj2 
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which combined with (3.19) completes the proof. □ 

Proof of Theorem 3.3. Let {fc n } n >i be a sequence of nonnegative 
integers and {a n } n >i a sequence of reals such that, for n > 1, 

k n -n/2 n-^ c>Q and 



/2 °l G nl ° g 2 n 

(3.21) 



l2{a^ + al)(n-k n ) log 2 (n - k n ) 
For such a sequence {k n } n >i, 

a n (rR[fc n ] - T s [n - k n ]) = a n (r R [n - k n ] -T s [n- k n ]) 

liminf n ^oo=-l 

+ Qn(rR[fc n ] - T R [ra - fc n ] - (2A: n - n)/x) + , 

where the first limit infimum is due to the standard law of iterated logarithm, 
the second limit is due to Theorem 5.1 of [6] on lag sums and the third limit 
is a consequence of (3.21). Hence, for a sequence {k n } n >i satisfying (3.21), 
we have 

(3.22) liminf a n (T R [k n ] -T s [n- k n \) =C-1. 



Using (3.22) and Lemma 3.1, with k n = \n/2 + (1 — eW 2(7^ n log 2 ra] , we 
have 

(3.23) RG(n)-n/ 2^ > x _ £ m fi mte ly f ten (j. .) a . s . y £ > 0. 
/2cr| G nlog 2 n 



Now similarly, working instead with k n = \ji/2 + (1 + e)J 2(7^ ra log 2 n\ , we 
have 

(3.24) RG{n)-n/2 = >1 + £ Qnly finitely often a s Ve > Q 



2crJ ?G ,nlog 2 n 

Statements (3.23) and (3.24) are equivalent to the first statement in (3.11). 
A similar argument leads to the other. Hence, the proof. □ 

4. Comparison of greedy and the alternating. The results of the last 
section, which say that the weak limit and the law of the iterated loga- 
rithm for Mq(-) and Ma(-) coincide, motivate asymptotic analysis of their 
difference — the goal of this section. 
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In our problem the reward is unbounded and grows linearly. Hence, an 
analogue to the average reward criterion in the bounded reward case would 
involve E(M(n))/n 2 . We note that by Corollary 3.3 and the dominated 
convergence theorem we have the equality of linin^^ n~ 2 E(M G (ra)) and 
linLn^+oo n" 2 E(MA(n)). This leads us to consider sensitive discount optimal- 
ity criteria to distinguish the performance of the alternating from that of 
the greedy. Let us denote by (resp., z^"), for < A < 1, the expected 
total A-discounted incremental matches under the alternating policy (resp., 
greedy policy). In the case of the alternating policy an easy calculation yields 
= (1 — A) _2 A(1 + A)" 1 /!. A study of the — 1-discount optimality of the 
alternating leads us to liminf^-|-i(l — A)(z^" — z/ A ). It follows by a Tauberian 
theorem of Hardy and Littlewood (see, e.g., Theorem 7.4 of [9]) that 

v M n/ G An v /E(M G (n) - M A M) ' 

hm(l - A)K G - ut) = bm [ K) n 

when either limit exists. The first theorem shows that the limit on the right 
exists and is positive, hence, showing that the alternating policy is not — 1- 
discount optimal. 

Theorem 4.1. For two chosen policies, one greedy and the other alter- 
nating, we have 

t 'E(M G (n) - M A (n))\ + 4 



n J Six 

Remark 4.1. It is easily seen that E(max(r R [ii G (n)],rs[S'G("-)])) rep- 
resents the expected incremental gain of matches from the (n + l)st pick by 
a greedy policy. The proof of Theorem 3.1 then gives us 

E(max(r R [^G(n)],rs[5 G (n)])) = Q)E(|r R [i? G (n)] - r s [S G (n)]|) 

+ Q)E(r R [i?G(n)]+r s [5 G (n)]) 

^E(|r R [ J R G (n)]-r s [5 G (n)]|) + ^. 

Interestingly, the greedy criterion implies that the process {r R [i? G (ra)] — 
rs[S G ( n )]}n>i is a bounded Markov chain on a subset of [—7,7], leaving 
aside the versions of greedy which introduce unnecessary path dependence 
on the epochs where the greedy criterion is ambivalent. This immediately 
leads to the relation 

(4.1) E(M G (n)) = Q) ^E([r R [i^(n)] - r s [g G (n)]|) + " (w ~ ^ 
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Remark 4.2. In the case of ergodicity of {T n [R G (n)] -Ts[Sci(n)]} n >i, 
the above theorem yields 

' n 2 
(7c 

along odd n's, 



E(AM G (n) - AM A (n)) o * 

h — , along even n s, 

8// 4 

where A is the difference operator. It is easily checked that there are exam- 
ples where E(AMc(n) — AMa(u)) for all large n is positive and where it 
oscillates in sign. 

Remark 4.3. In the case of geometric ergodicity of {rR,[i? G (?7,)] — 
rs[5 , G(^)]}n>i, the above theorem together with (4.1) leads to an expansion 
of the form {r R [i? G (n)] - r s [5 G (n)]} n >i, 



2 

E(M G (n)) = ^/+n 



+ constant + e n , 



S f i 

where e n tends to zero exponentially fast. This relation for the illustrative 
example is given in (1.1), and Figure 1 graphically describes the Markov 
chain {r R [i? G (n)] - r s [S G (n)]} n >i- 



Remark 4.4. The above, in particular, implies that, for large n, 

"2 2 

E(M A (n)) < E(M G (n)) < E(M A (n + k)), where k := aR + °" s 

4/x^ 

An easy but informative upper bound for the k of the above equation is 
given by [[(1 — /i)/2^]] . It can be easily shown that, in general, k cannot be 
bounded away from infinity. 

The first theorem, while able to distinguish between the greedy and the 
alternating policies, also shows that the difference is rather slim. It would 




l 

2 

Fig. 1. Embedded Markov chain of the illustrative example. 
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be interesting to know whether these two policies, if implemented on the 
same sampled sequence of labels, will yield (in some sense) more matches 
under the greedy than the alternating? Such optimality is referred to in the 
literature as sample path optimality. Results on sample path optimality for 
the case of bounded rewards and finite/countable state and action spaces 
under the assumption of uniform ergodicity of the state process can be found 
in [7] and the references therein. The second theorem shows that the weak 
limit of Mg(n) — Ma(k) under this coupling is a scale mixture of normals 
centered at zero. The final theorem of this section throws some light on its 
sample path behavior under the same coupling. A more precise study of the 
sample path behavior of M G {n) — Mx{n) is beyond the scope of this paper. 

Theorem 4.2. For two chosen policies, one greedy and the other alter- 
nating, we have 



( M G (n) - M A (n) \ d 
V n 5 /4 J ~^ 



as n — ► oo, 



where F is a scale mixture of normals centered at zero given by 



J N{0,a 2 )dG{a 2 ), where G := 



N|0 ' 128^ 



Theorem 4.3. For two chosen policies, one greedy and the other alter- 
nating, we have 

.. . JM G {n)-M A (n)\ 

liminf — — — T . = — oo a.s., 

n-,00 \ n 5/4 (fog n ) 1/4 J 

(4.2) 

/ M G (n) - M A (n)\ 
hmsup — — — —J-. = oo a.s. 



and 



("■3) ^i'^^ Od^") 172 ) "■>■ 

n t) / 4 (log 2 n) 1 / 4 



In the following we will need the filtration {Q n }n>o defined as 

Qn = Go V <7<Lr(1), . . . ,L R (i? G (n)); L s (l), . . . ,L s (S G (n))), n > 1, 

with Qq containing all the information needed for randomization by not 
only Cg but also C\- The argument for the above results depends on the 
sequence of random times {T n } n >i, where T n is essentially the epoch at 
which the greedy decides to pick the first record (from R or S) which would 
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not be seen following the alternating policy by the nth epoch. Formally, they 
are defined as 



mini n, inf < k > 1 



S G (k + l) 



+ 1 or R G (k+l) 



+ 1 



(4.4) 

for n > 1. It is easily checked that {T n } ra >i is a sequence of {G n }n>o stopping 
times. Also, for convenience, we define the sequence of events {^4 n }n>i by 
A n :={R G {T n ) = \n/2\} forn>l. 

4.1. Study of stopping times {T n } n >i . We note that, similarly to Lemma 
3.1, it can be shown that, for positive x, 



(4.5) 

and 

(4.6) 



(n - T n ) > x 



(n 



T n ) > X 



r R [[n/2l]<r s [Ln/2j -bJ] + 7 or 

r s [K2l]<r R [K2j-N] + 7 

r R [r ? V2l]<r s [Ln/2j - fxl] or 
r s [rn/2l]<r R [Ln/2j-rxl]. 



This leads to the first lemma which describes both the weak limit and sample 
path behavior of T n . The second lemma is the weak law of large numbers for 
the post T n (and pre-n) selection ratio. The third lemma derives exponential 
probability inequalities for both T n and R G {n) which are useful in establish- 
ing the required uniform integrability results and the uniform central limit 
theorem of Lemma 4.6. 



Lemma 4.1. For the above defined stopping times {T n } n >i, the following 
hold: 

(i) 

(4.7) IL^^| N (o,4 G )| asrwoo, 

where o~\ , the asymptotic variance of R G {n), is defined in (3.10). 
(ii) 

Th — T-Tl 

liminf (n — T n ) = a.s. and limsup = 1 a.s. 

y / 8o-| G nlog 2 n 

(4.8) 

Proof. A proof of (4.7) and the second part of (4.8) follows along 
similar lines as Theorem 3.2 and Theorem 3.3, respectively. The key dif- 
ference being that (4.5) and (4.6) are used instead of Lemma 3.1. The 
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first part of (4.8) follows as Theorem 3.3 implies that the event {i? G (n) = 
Sg(ti) for infinitey many n} occurs with probability one. The details are 
skipped to avoid repetition of similar arguments. □ 

Lemma 4.2. For the above defined stopping times {T n } n >\ correspond- 
ing to the greedy reading policy C G; we have 

fAQ s R G (n) - R G (T n ) p 1 

(4.9) — >- asn^oo. 

n i n l 

PROOF. First, we show that 

(4.10) — — ► oo asn— >oo. 

log(ra) 

By a double application of (2.5), we have 

(4.11) |(r R [i? G (n)] - r R [R G (T n )}) - (T s [SG(n)]-T s [S G (T n )})\ < 2 7 , 
which implies that, for any positive K, 

R G (n) - R G (T n ) < Klog(n) => 

(4.12) 

n~T n ~K\og(n) K\og{n) 

J2 Xs(i + S G (T n ))-2 7 < £ X R (i + R G (T n )). 

i=l i=l 

The second expression can be rewritten as 

r- T "^**w Xs (i + s G {T n ))-A _ ( K ^ n) x R ( i + R G (T n ))-A 

{ ^ y/n-T n -K\og(n)J { ^ jK\og(n) ) 



< 2 7 - fjL(y/n -T n - mog(n) - y/K\og(n)). 

As n~ l / A (n — T n ) oo (Lemma 4.1), we have the independent terms on the 
left converging to normal distributions and the term on the right converging 
to negative infinity. Hence, the probability of the above event converges to 
zero. Now using (4.12), we obtain (4.10). Combining (4.10) with Theorem 
3.3, we have 

(a iq\ o t \ P A RG,(n)-R G {T n ) P 

(4.13) Rgv 1 ) — > oo and — - — — > oo as n — > oo. 

log(i2 G (n)) 

Statement (4.13) together with Theorem 5.1 of [6] on lag sums gives us 
/ L R [fi G (n)]-F R [^ G (r w )] \ 
\ R G (n) - R G (T n ) J 

and 

T s [5 G (n)]-Ls[5 G (T n )]\ p 



A* 



5 G (n) - 5 G (T n ) 



Mi 
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where the second part follows by symmetry. This with (4.11) and (4.10) 
gives 

R G (n)-R G (T n ) P 
S G (n)-S G (T n ) ~^ ' 

which is equivalent to (4.9). □ 

Remark 4.5. The above, in particular, implies that 

R G (n) - R G (T n ) (R G {n) - R G {T n )\ (n-T n 



n V n — T n J \ x/n 



| N(0,Ofl G )| as?i^oo, 

where the convergence in probability of the first term follows from (4.9) and 
the weak convergence of the second term to the folded normal follows from 
(4.7). In fact, the stronger result 

R a (n)-R a (T n )-(n-T„y 2 ^ 
\Jn-T n 

can be shown using an argument similar to that used to prove Theorem 3.2. 

Lemma 4.3. For a greedy policy, we have the following: 
(i) For t > 1 and n > (2 + ^) 2 , 



(4.14) Pr 



R G (n)-n/2 x \ ^ n f ^ x 2 



1 *y ] 



n>l 



» n 

(ii) The sequence 

^ {(^ 

is uniformly integrable. 

Proof. Toward proving (4.14), let k = \n/2-\-t\/n\ . By Lemma 3.1, we 
have 

Pv(R G n>n/2 + ty/n) < Pr(r R [/c] < T s [n - k] + 7). 
Observe that 

T s [ n -k]- T u [k] + (2k - n)/i = [T s [n-k]- T R [n - k}] 

-[T R [k]-T R [n-k]-(2k-n)n], 
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which implies that the term on the left-hand side is a sum of k indepen- 
dent zero mean random variables taking values in [—7,7]. This along with 
Hoeffding's inequality (see [15]) implies 

PrraM < r.[„ - *] +7 < expj z&Z^]! }. 

Working with the upper bound above and using the inherent symmetry, we 
get the simple upper bound in (4.14). 

For the sequence in (4.15), we get an inequality similar to (4.14) by imi- 
tating the above argument — the only change being that (4.5) and (4.6) are 
used instead of Lemma 3.1. Since this bound is integrable and free of n, we 
have the uniform integrability of the sequence in (4.15). □ 

4.2. Heuristics for the theorems. We find it convenient to divide the 
records sampled by the nth epoch by either the alternating or the greedy 
into three sets. The first set consists of the first T n records sampled by 
the greedy. The second consists of the last n — T n records sampled by the 
greedy. The third consists of the records sampled by the alternating and 
not contained in the first set. Observe that all of the records in the first set 
(except possibly one) are sampled by the alternating by the nth epoch. Also 
note that all of the records in the third set belong to a single source and its 
cardinality is within one of the cardinality of the second set. The upshot of 
this is that Mq(w) — Ma(w) is essentially the number of matches between 
records of the first and the second set and the records of the second set with 
themselves minus the number of matches between records of the first and 
the third set. 

First, we argue that in the expected difference E(Mc(n) — M^{n)) the 
significant term comes from the matches generated among the records of the 
second set, which by Lemmas 4.1 and 4.3 will be of order n. This is so as 
the expected number of matches between records of the first and the sec- 
ond set minus the expected number of matches between the first and third 
set is at the most of order ^Jn — follows by observing that |rR[i?G(?n)] — 
rs[5 , G(r n )]| < 7 and n — T n = O p (y/n) (by Lemma 4.1). Second, by Lemma 4.2, 
roughly (n — T n )/2 of the records in the second set will be from each source 
and, hence, using the law of large numbers, the expected number of matches 
generated by these among themselves will be approximately E((n — T n ) 2 ) /j, / 4 
or by Lemma 4.1, approximately n^cr^. This completes the heuristic for 
Theorem 4.1. Lemma 4.4 formalizes the latter part of the argument and the 
proof of Theorem 4.1 does the rest. 

In contrast to the above, in the study of the sample path behavior and 
weak limit of Mg(n) — Ma(b) we find that the insignificant terms of the 
above become significant and vice-versa. First, the number of matches gen- 
erated by the records of the second set with themselves is comparable to 
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M G (V n — ^n) which is p (n), using Corollary 3.1 and Lemma 4.1. On the 
other hand, the number of matches between records of the first and the 
second set minus the number of matches between the first and third set is 
O p (n 5 / 4 ). Toward an argument, suppose without loss of generality that the 
greedy picks more R records than the alternating. Now the above difference 
is easily checked to be the difference in the numbers of matches generated 
by the excess R records sampled by the greedy and the excess S records 
sampled by the alternating, both with records from the first set. And di- 
vided by n, by the law of large numbers, the distribution of labels on the 
records from both sources in the first set approaches their respective vectors 
(f or s). This then makes the difference resemble nG n , where the sequence 
of random variables {G n } n >i is defined as 

9r ._ / (r R [i? G (n)] - r R [\n/2]}) - (r s [Ln/2j] - r s [5 G (n)]), on A n , 
ZUn • I (T S [S G (n)\ - T s [\n/2]]) - (T R [[n/2\] - T R [R G (n)}), on A° n . 

Finally, G n being the (n — T n )th term of a bounded increment martin- 
gale is of order y/n — T n (O p (n 1//4 )) and, more importantly, normalized 
by \Jn — T n should converge to a normal limit. This is essentially the ar- 
gument for Theorem 4.2, while Theorem 4.3 follows as an application of 
the above with a Borel-Cantelli type argument. Lemma 4.5 proves that 
?i~ 5 / 4 (Mg(^) — A/a(^)) ~ n~ l l 4 G n , Lemma 4.6 provides a uniform central 
limit theorem for the martingales behind G n and the proofs of the theorems 
complete the rest of the arguments. 

4.3. Proofs of the theorems. 

Lemma 4.4. For a greedy policy and the sequence of stopping times 
{T n }n>i defined in (4.4) ; we have 

(±) [N R (R G (n)) - N R (R G (T n ))} 

(4.16) 

x [N S (S G (n))-N s (S G (T n ))]^ 
Moreover, we also have L 1 convergence. 



Proof. First, we will show that 

~Nn(R G (n))-N R (R G (T n )) 



(4.17) 



R G (n)-R G (T n ) 

N s (S G (n)) - N S (S G (T n )Y 
S G (n) — S G (T n ) 
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Note that by inherent symmetry, Slutsky's theorem and the fact that both 
terms above are probability vectors, it suffices to show that 



(4.18) 



N R [R G (n) 1 j]-N Tl [R G (T n ),j} 
R G (n)-R G (T n ) 



i>i- 



Now (4.13) combined with Theorem 5.1 of [6] on lag sums gives us (4.18) 
and, hence, (4.17). 

Second, by Lemma 4.2 and Slutsky's theorem, 



(4.19) 



R G (n) - R G (T n ) \ fS G (n) - S G (T n 



[n 



(n - T n ) 



1 

4' 



Combining (4.17) and (4.19) with Lemma 4.1, and using Slutsky's theo- 
rem, we have (4.16). Now observe that (4.17) and (4.19) are nonnegative 
sequences bounded above by one. This together with the uniform integra- 
bility of the sequence {n~ l (n — T^n)) 2 } n >i provided by Lemma 4.3 gives us 
the L 1 convergence. Hence, the proof. □ 



PROOF of Theorem 4.1. Working on the set {R G (T n ) = \n/2]}, we 
have 

M G (n)-M A {n) 

= (Nn(R G (n)) - N R (\n/2])) ■ N s (S G (T n )) 
(4.20) + (N R (R G (n)) - N R (\n/2})) 

x (N s (S G (n)) - N s (S G (T n ))) 
- N R (\n/2]) ■ (N s ([n/2\) - N s (S G (n))). 
The first term on the right-hand side can be written as 

(N R (R G (n)) - N R (\n/2]) - [R G (n) - \n/2]]r) 

(4-21) 

x N s (S G (T n )) + [R G (n) - rn/2l]r s [S G (T n )]. 

The first expression in (4.21) has zero conditional expectation given Qx n on 
the set A n , as it is the (n — T n )th term of a zero martingale. The argument 
for this assertion is similar to that found in the proof of Theorem 3.1. The 
third term on the right-hand side of (4.20) can be written as 

N n (\n/2]) ■ (N S ([n/2\) - N s (S G (n)) - [R G (n) - \n/2]]s) 

(4.22) 

+ [R G (n)-\n/2]}T R [\n/2}). 

The first expression in (4.22) has a conditional expectation of zero on the 
set A n as it is independent of G n {71 QtS) and conditioned on Q n , has zero 
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mean. Using symmetry together with (4.21) and (4.22), we have 



n 



\E(M G (n) - M A (n)) 
-E([N R (R G (n)) - N n (R G (T n ))] • [N s (S G (n)) - N s (S G (T n ))})\ 
< iE(|r R [i? G (T n )]-r R [5 G (T„)]||i? G (n) - \n/2]\) 
< 7E (l^(")-rn/2ll^ 0> 

where the convergence to zero of the last term follows by Theorem 3.2 and 
the dominated convergence theorem. The theorem follows now by using 
Lemma 4.4. □ 



Lemma 4.5. 

f M G (n) - M A (n)\ G r 



n 



5/4 



)1 



1/4 







as n — > co. 



Proof. We start with a decomposition analogous to (4.20), 
M G (n)-M A (n) 



1 



Gn 



(N R (R G (n)) - Nn(R G (T n ))) 



(4.23) 



x(N s (S G (n))-N s (S G (T n ))) 

+ l An (N R (R G (n)) - iV R (i? G (T n ))) 

- (N s ([n/2\) - N s (S G (n))) ■ 



(Ns(S G (n)) - N s (S G (T n ))) ■ 
- (N R ([n/2\) - N n (R G (n))) 



( N s (S G (T n )) s 

V n 2 

( N R (R G (T n )) f 

V re 2 

N R (R G (T n )) r 

n 2 

( N s {S G (T n )) s 

V re 2 



We now show that each term on the rightdiand side of (4.23), upon division 
by n 1 / 4 , converges almost surely to zero. For the first term, using 

(Nn(R G (n)) - N R (R G (T n ))) • (N s (S G (n)) - N s (S G (T n ))) 

< (n-T w ) 2 
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the result follows from Lemma 4.1. The second and third terms on the right- 
hand side of (4.23) are similar (by symmetry) and, hence, it suffices to deal 
solely with the second. We observe that similar arguments exist to show that 
each of the two expressions forming the second term, when divided by ra 1 / 4 , 
converges almost surely to zero. Hence, we give only the argument for the 
first. Since R G (T n ) + S G (T n ) = T n , we observe that on A n 



(4.24) 



n/2-S G (T n 



n 



5/8 



n/2-S G (T n ) 
max(n — T n , 1) 



max(n — T n , 1) 



n 



5/8 



0. 



bounded by 1 



*0 (Lemma 4.1) 



Second, we define two sequences of random variables converging almost 
surely to zero. Let {U n } n >i be defined as 



R G (n)-R G (T n ) 



n 



S G (T n ) 



n 



n 



(4.25) 



bounded by 1 



n 5/8 
(Lemma 4.1) 



'log 2 S G (T n ) 



n 



Vn> 1 



and {W n } n >i as 



W n 



R G (n)-R G (T n ) 



n - 



(4.26) 



bounded by 1 

n/2-S G (T n ) 



n 



5/8 



n-T n 

n 5/8 

(Lemma 4.1) 

Vn> 1. 



2^0 by (4.24) 

Third, using the above, we decompose the expression of interest as 

N R (R G (n)) - Nr(R G {Tu))\ fN s (S G (T n ))_ _ £ 

2 



n 



1/4 



n 



Nn(R G (n)) - N R (R G (T n )) 
R G {n)-R G (T n ) 



S G (T n ) fN s (S G (T n )) 



log 2 S G (T n )\ S G (T n 
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N R (R G (n)) - N R (R G (T n )) 
R G (n)-R G (T n ) 



■ s . 



bounded by 1 



In view of (4.25) and (4.26), to show that the above converges almost surely 
to zero, it suffices to show that, with probability one, 



lim sup 



' NK(R G (n)) - N R (R G (T n )) 
R G (n)-R G (T n ) 



1 S G (T n ) ( N s (S G (T n )) 
log 2 S G (T n ){ S G (T n ) 



< oo. 



But this follows from the Cauchy-Schwarz inequality and Lemma A.l since 
the first term is a probability vector. Hence, the proof. □ 

The proof of Theorems 4.2 and 4.3 will require a uniform central limit 
theorem for a class of policies which can be described as greedy with off- 
sets. This is the content of the next lemma; below we describe some needed 
notation. Let G$, for 5 G [—7,7], be a policy satisfying 



C Gs (n + l) 



1, ifr s [5(n)]>r R [12(n)] + ff, 
0, ifr s [5(n)]<r R [i?(n)]+5, 



1,2,.... 



Let {X^(n)} n >i and {X^(n)} n >i denote two auxiliary sequences of i.i.d. 

random variables with = and X$ = X$, and let Tj^(-) and Tg(-) 
denote their respective partial sums. For 5 £ [—7,7], we define the sequence 
of random variables {Y^} n >\ and {Z^} n >i as 



Y° 



[T R [R Gs (n)]-TUR Gs (n))}, 



n>l 



and 



[r s [s G »]-r R (s G »)], 



Lemma 4.6. There exists a K > such that 



(4.27) 



max 

<5e[-7,7] 



n > 1. 



sup I Pr(y n 5 < t) - $(t)| , sup I Pr(Z* < t) - 

teR ten 



<Kn~ 1 / 4 log( 



n 



Proof. It suffices, by symmetry, to show that the first of the two ex- 
pressions in (4.27) satisfies the bound. We use a filtration {TCm}m>o defined 
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for m > 1 as 

n m = n y a(L n (i), l r (r Gs (m)) ; 

L s (l),...,L s (S G6 (m));Xl i (l),... ) Xl l (R G6 (m))), 

with TIq containing all the information needed for randomization by C Gs . 
Also, we define, for a fixed n > 1, 



X R (i? Gf (m))-Xg(i? Gi (m)) 



if C7 G4 (m) = l, 



m = 2, 3, . . . , n, 



ifC Ga (m) = 0, 
with Z?i := -^L. By construction, 

n 

(4.28) 2- D * = y n and maxA<n" 1/2 



i<n 



al + al)/2 



As C G(5 (m) is TCm-i measurable and both Xn(R Gg (m)) and Xg(i?G! 5 (m)) 
are independent of 7i m -i, we have 

2" 



E(D m |W m _i) = and E(£>4|W m _i) 



(4.29) 



■;?. 



1 < m < n. 



Hence, as D & m is Tt m measurable, Ai}l<m<n is a martingale. As a 

consequence of (4.29), we have 

n ,n\ 

Vn ■= E E (A 2 |^_i) = - )R Gs (n). 



i=l 



n 



This implies that 

Pr(|V; 2 -l|>n- 1 / 2 (log(n)) 2 ) 



(4.30) 



Pr 



R Gs (n)-n/2 



n 



> ( 5 ) (log(n)) 2 



By an argument similar to that in the proof of Lemma 4.3, we get, analogous 
to (4.14), for t > 1 and n > 4(1 + ^) 2 , 



(4.31) 



Pr 



R Gs (n)-n/2 



I'- 

i 2 



n 



> t < 2 exp 



4"> 



Combining (4.30) and (4.31), we get 
(4.32) Pr(|y n 2 - 1| > n- 1 / 2 (log(n)) 2 ) < exp 



Vn> 1. 
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Inequalities in (4.28) and (4.32) imply that the two conditions of Lemma 
A. 2 are satisfied in our case. Hence, we have (4.27), for some K free of 5. 
Hence, the proof. □ 

Proof of Theorem 4.2. In view of Lemma 4.5, it suffices to derive 
the weak limit of n~ l / A G n . We start by observing that, for aglon A n , 



p / (r R [flG(»)] -r R [rn/2]]) - (r s [Ln/2j] -r s [5 G (n)]) 
V v /(4 + al)(n-T n ,)/2 

= Pr(Y n % n <u) 
and on A^, 

(T s [S G (n)] - r s [\n/2)}) - (T R [[n/2\] - r R [R G (n)]) 



<u 



Pr 



{al + al){n-T n )/2 



< u 



= Pr(Z n % n <n), 

where A n := T-R\Rc(T n )} — Ts[SG,{T n )\. This, along with Lemma 4.6, leads 
to 



Pr 



0.125(<4 + al)(n-T n ) 
I Pr(y n A _" rn <u)dP+ [ Pr(Zt Tn <u)dP- 

" An. J 



< KE(min[l, (n - T n )~ 1/4 log(n - T n )]) as n -> oo. 
In other words, we have shown that 
G n d 



N 



CT R + ^s 



as n — > oo. 



This with the asymptotic independence between the terms on the right of 

G n 



1/4 



G n 
Vn-T n 



<n — T n 



and Lemma 4.1 completes the proof. □ 

Proof of Theorem 4.3. In view of Lemma 4.5, to prove (4.2), it 
suffices to show that 



lim inf ■ 



Gn 



nlog 2 (n)) 1 / 4 



-oo and lim sup 



G, 



(nlog 2 (n))V4 



oo. 
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Due to the similarity of the arguments, we prove only the latter. Toward 
this end, we define a sequence of stopping times {T*} n >i as 

r inf{T 2fe > T*^\k > 2; R G (T 2k ) = k; 
T* = \ 2k-T 2k > a RG ^/2klog 2 {2k)}, n odd, 

[inf{A;>T*_ 1 | J R G (A;) = 5G(A;)}, n even. 

The stopping times are easily checked to be well defined using the definition 
of {T n } n >i and Theorem 3.3. Let C > be an arbitrary constant and {-B«}i>i 
be a sequence of events defined by 

l(24 GJ RG(T 2 V 1 )log 2 (2 J R G (T 2 V 1 )))V4 >G i' 

Also let {7Yj}i>o be a filtration with 7Yj := Gt%. +1 f° r i > 0. By construction, 
i?j G for i > 1 and, moreover, applying Lemma 4.6 as in Theorem 4.2, we 
have 



G2Rg(t ^ ] >C 



Hi-] 



1 - $(C) 
> > 0, for large z. 

Now Lemma A. 3 implies that, with probability one, Bi occurs infinitely 
often. This completes the proof of (4.2). 

To show (4.3), it suffices to look at the subsequence of even epochs. If 
i? G (2n) = n, then M G (2n) = M A (2n). Suppose that i? G (2n) = n + K n > n 
(swap R with S in the contrary). This leads to 

M G (2n) - M A (2n) < n{T R [n + K n ] - T n [n}) 

+ (N s (n) - ns) ■ (N R (n + K n ) - N R (n)) 

-n(r 3 [n]-T s [n-K n ]) 

- (N R (n) - nr) • (N s (n) -N s (n- K n )). 

All statements which follow are to be understood as holding eventually, with 
probability one. By Theorem 3.3, K n < B(nlog 2 n) 1 / 2 for some constant B 
and by Lemma A.l, the components of (iVs(n) — ns) are uniformly bounded 
by n 9 / 16 . Since the components of (Nji(n + K n ) — iVa(n)) are bounded by 
K n , we have the second term above is of order at most n 9 ' s . The same holds 
true for the fourth term. By Theorem 5.1 of [6] on lag sums, we have, for 
< k < B(nlog 2 n) 1 / 2 and for some constant C, 

n(T n [n + k]- r R [n]) - n(r s [n] -T s [n- k]) 



< n(ku + C \/k log n ) — n{kn — C\Jk log n ) 

< 2Cn 5 / 4 v / loi^(log 2 n) 1 / 4 . 
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This implies that M G (2n) - M A (2n) < ^Cn b l A ^J\ogn{\og 2 n) l / i . Similarly, 
the difference Mc(2n) — Ma(2ti) can be bounded from below. Hence, the 
proof. □ 

APPENDIX 

The first lemma, on the rate of I 2 convergence of empirical probabilities to 
the true probabilities, derives from [14] and is included here for the reader's 
convenience. 

Lemma A.l. Let {Zj}j>i be a random sample from a discrete distribu- 
tion described by 

pj :=Pr(Zi = zj), j = 1,2,... with ^pj = 1. 

Defining the empirical probability vector (Pj)j>i, for n > 1, by 
p]=(^)#{l<k<n:X k = Zl }, j = 1,2, 



limsupf H ) V>™ - Pj) 2 ^C<oo. 
\ log 2 (n) ' 



i=i 



Proof. Without loss of generality, we assume that Zj = j, for j>l. 
Defining the kernel h(-, •) by 



Hhj) =I{i=j}~ (Pi+Pj) + J2(Pk) 2 , i,j = 1,2,..., 

we have 



fc>i 



logo(n)/r- J 3 Vralog 9 (?i). 



+ (w^t)( 1 + I» : 

fc>l 



f 1 ^/ 










(loj 


5 2 (n))( 



n 



It is easy to check that the first term on the right is a canonical [/-statistic 
of order 2. By [2], we have 

limsupf 2 ) V h(Xi,Xj) -^C< oo 
v nlog 2 (n) ' ' 



l<i<j'<n 
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for some constant C. The third term on the right converges to zero in the 
almost sure sense by the usual SLLN. Hence, the proof. □ 

The following uniform central limit theorem for martingales is a restate- 
ment of Theorem 3.7 of [5] for a noncanonical filtration [see Remark (ii) on 
page 84 following the theorem]. 

Lemma A. 2. Let {Si = J2\ Xj,Hi, 1 < i < n} be a zero-mean martingale. 
Let 

i 

Vf = 2E(X?|H j _i), l<i<n 
l 

and suppose that 

max \Xi\ < n~ l l 2 M a.s. 

i<n 

and 

Pt(\V 2 - 1| > 9M 2 Dn~ 1/2 (logn) 2 ) < Cn _ 1/4 logn 
for constants M , C and D(> e). Then for n>2, 

sup | Pr(5 n <x) — $(x)| < Kn~ 1 / 4 \ogn, 

where K is a universal function of M , C and D. 

The last result is a conditional Borel-Cantelli lemma which appears as 
Theorem 2.8.5 in [17]. 

Lemma A. 3. Let {Bi,i > 1} be a sequence of events and {Hi,i > 1} an 
increasing sequence of a -fields such that Bi £ Tii for each i > 1. Then 

{Bi i. .} = |£pr( J B i |Wi_ 1 )=oo|, 

that is, Ya^=i ~PT(Bi\Hi-{) < oo implies the Bi occur at most finitely often 
and J2i^=i P r (^i|^i-i) = 00 implies the Bi occur infinitely often. 
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